{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Naive Bayes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Vamos criar a classe Naive Bayes para representar o nosso algoritmo.\n",
    "O método **__init__** representa o construtor, inicializando as variáveis do nosso modelo.\n",
    "O modelo gerado é formado basicamente pela frequência das palavras, que em nosso caso, representa os possíveis valores de cada feature e label.\n",
    "\n",
    "* O defaultdict é utilizado para inicializar nosso dicionário com valores default, no caso 0 (int), para chaves que tentamos acessar e ainda não foram adicionadas.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from collections import defaultdict\n",
    "from functools import reduce\n",
    "import math\n",
    "\n",
    "class NaiveBayes:\n",
    "\n",
    "    def __init__(self):\n",
    "\n",
    "        self.freqFeature = defaultdict(int)\n",
    "        self.freqLabel = defaultdict(int)\n",
    "\n",
    "        # condFreqFeature[label][feature]\n",
    "        self.condFreqFeature = defaultdict(lambda: defaultdict(int))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Modelo\n",
    "\n",
    "Como o modelo é representado basicamente pela frequência das palavras, precisamos categorizar os possíveis valores das features. Após esse processo, fazemos a contagem.\n",
    "\n",
    "* countFrequencies: faz a contagem que cada valor de feature e label aparecem em todo o dataset de treino, independentemente.\n",
    "* countCondFrequencies: faz a contagem que cada valor de feature aparece para cada possível label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def countFrequencies(self)    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def countCondFrequencies(self)    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Treino\n",
    "\n",
    "Vamos treinar o nosso modelos. O que deve ser composto na função de treino?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def train(self, dataSet_x, dataSet_y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Classificação\n",
    "\n",
    "Com o modelos em mãos, agora podemos classificar nosso dataset. Abaixo, segue algumas dicas para tratarmos melhor os dados em nossa função."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def predict(self, dataSet_x):\n",
    "\n",
    "        # Correcao de Laplace\n",
    "        # P( f | l) = (freq( f | l ) + laplace*) / ( freq(l)** + qnt(distinct(f))*** )\n",
    "        #\n",
    "        # * -> laplace smoothing: add 1\n",
    "        # ** -> Frequencia com que o valor de label aparece\n",
    "        # *** -> Quantidade de features distintas\n",
    "        #\n",
    "\n",
    "        # Devido a possibilidade de underflow de pontos flutuantes, eh interessante fazer\n",
    "        # P(x1|l)*P(x2|l) ... -> Log(P(x1|l)) + Log(P(x2|l)) ...\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pré-processamento\n",
    "\n",
    "Abaixo uma função de suporte para a leitura do nosso dataset. Em seguida, um processo de separação dos dados entre dados de treino e teste."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import random\n",
    "\n",
    "# Car dataset\n",
    "# Attribute Information:\n",
    "#\n",
    "# Class Values:\n",
    "#\n",
    "# unacc, acc, good, vgood\n",
    "#\n",
    "# Attributes:\n",
    "#\n",
    "# buying: vhigh, high, med, low.\n",
    "# maint: vhigh, high, med, low.\n",
    "# doors: 2, 3, 4, 5more.\n",
    "# persons: 2, 4, more.\n",
    "# lug_boot: small, med, big.\n",
    "# safety: low, med, high.\n",
    "\n",
    "#Retur dataset\n",
    "def readFile(path):\n",
    "    rawDataset = open(path, 'r')\n",
    "\n",
    "\n",
    "    suffix = ['_buy', '_maint', '_doors', '_pers', '_lug', '_safety', '_class']\n",
    "\n",
    "    dataset = []\n",
    "\n",
    "    rawDataset.seek(0)\n",
    "    for line in rawDataset:\n",
    "    \tl = line.split(',')\n",
    "        l[-1] = l[-1].replace(\"\\n\", \"\")\n",
    "        newTuple = map(lambda (x,y): x+y, zip( l , suffix))\n",
    "        dataset.append( newTuple )\n",
    "\n",
    "    return dataset\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "def main():\n",
    "\n",
    "    preparedDataset = readFile('carData.txt')\n",
    "\n",
    "    random.shuffle(preparedDataset)\n",
    "\n",
    "    dataset = []\n",
    "    #Features\n",
    "    dataset.append([])\n",
    "    #Label\n",
    "    dataset.append([])\n",
    "\n",
    "    for t in preparedDataset:\n",
    "        dataset[0].append(t[:-1])\n",
    "        dataset[1].append(t[-1])\n",
    "\n",
    "\n",
    "    dataSet_x = dataset[0]\n",
    "    dataSet_y = dataset[1]\n",
    "\n",
    "    nTuples = len(dataSet_x)\n",
    "\n",
    "    nToTrain = int(nTuples * 0.7)\n",
    "\n",
    "    dataSet_x_train = dataSet_x[:nToTrain]\n",
    "    dataSet_y_train = dataSet_y[:nToTrain]\n",
    "\n",
    "    dataSet_x_test = dataSet_x[nToTrain:]\n",
    "    dataSet_y_test = dataSet_y[nToTrain:]\n",
    "\n",
    "    naive = NaiveBayes()\n",
    "\n",
    "    naive.train(dataSet_x_train, dataSet_y_train)\n",
    "\n",
    "    accuracy = 0.0\n",
    "\n",
    "    results = naive.predict(dataSet_x_test)\n",
    "\n",
    "    for index, r in enumerate(results):\n",
    "        yPredicted = max(r, key=r.get)\n",
    "        y = dataSet_y_test[index]\n",
    "        \n",
    "        if(y == yPredicted):\n",
    "            accuracy += 1.0\n",
    "\n",
    "    print accuracy / len(dataSet_y_test)\n",
    "\n",
    "main()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Atividade\n",
    "\n",
    "#### Perceba que separamos os dados de treino e teste em 70% e 30%, respectivamente. Qual tal testarmos nosso classificador com o processo de Cross-validation? Implemente a abordagem K-fold, com k = 10.\n",
    "\n",
    "Referência: https://www.cs.cmu.edu/~schneide/tut5/node42.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
